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Abstract 

This paper considers the problem of recovering an unknown sparse p x p matrix X 
from an m X m matrix Y = AXB^ ^ where A and B are known m x p matrices with 
m <^ p. 

The main result shows that there exist constructions of the "sketching" matrices A 
and B so that even if X has 0{p) non-zeros, it can be recovered exactly and efficiently 
using a convex program as long as these non-zeros are not concentrated in any single 
row/column of X. Furthermore, it suffices for the size of Y (the sketch dimension) to 
scale as m = O (y^ # nonzeros in X x logp) . The results also show that the recovery is 
robust and stable in the sense that if X is equal to a sparse matrix plus a perturbation, 
then the convex program we propose produces an approximation with accuracy pro- 
portional to the size of the perturbation. Unlike traditional results on sparse recovery, 
where the sensing matrix produces independent measurements, our sensing operator is 
highly constrained (it assumes a tensor product structure). Therefore, proving recov- 
ery guarantees require non-standard techniques. Indeed our approach relies on a novel 
result concerning tensor products of bipartite graphs, which may be of independent 
interest. 

This problem is motivated by the following application, among others. Consider 
a p X n data matrix D, consisting of n observations of p variables. Assume that the 
correlation matrix X := DD^ is (approximately) sparse in the sense that each of the p 
variables is significantly correlated with only a few others. Our results show that these 
significant correlations can be detected even if we have access to only a sketch of the 
data S = AD with A G 

Keywords, sketching, tensor products, distributed sparsity, £i minimization, compressed 
sensing, covariance sketching, graph sketching, multi-dimensional signal processing. 



1 Introduction 

An important feature of many modern data analysis problems is the presence of a large 
number of variables relative to the amount of available resources. Such high dimensionality 
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occurs in a range of applications in bioinformatics, climate studies, and economics. Accord- 
ingly, a fruitful and active research agenda over the last few years has been the development 
of methods for sampling, estimation, and learning that take into account structure in the 
underlying model and thereby making these problems tractable. A notion of structure that 
has seen many applications is that of sparsity^ and methods for sampling and estimating 
sparse signals have been the subject of intense research in the past few years [11, 10, 20] 

In this paper we will study a more nuanced notion of structure which we call distributed 
sparsity. For what follows, it will be convenient to think of the unknown high-dimensional 
signal of interest as being represented as a matrix X. Roughly, the signal is said to be 
distributed sparse if every row and every column of X has only a few non-zeros. We will see 
that it is possible to design efficient and effective acquisition and estimation mechanisms for 
such signals. Let us begin by considering a few example scenarios where one might encounter 
distributed sparsity. 

• Covariance Matrices: Covariance matrices associated to some natural phenomena 
have the property that each covariate is correlated with only a few other covariates. 
For instance, it is observed that protein signaling networks are such that there are 
only a few significant correlations [45] and hence the discovery of the such networks 
from experimental data naturally leads to the estimation of a covariance matrix (where 
the covariates are proteins) which is (approximately) distributed sparse. Similarly, the 
covariance structure corresponding to longitudinal data is distributed sparse [ ]. See 
Section 1.3.1. 

• Multi-dimensional signals: Multi-dimensional signals such as the natural images 
that arise in medical imaging [ ] are known to be sparse in the gradient domain. When 
the features in the images are not axis-aligned, not only is the matrix representation 
of the image gradient sparse, it is also distributed sparse. For a little more on this, see 
Section 1.3.3 

• Random Sparse Signals and Random Graphs: Signals where the sparsity pattern 
is random (i.e., each entry is nonzero independently and with a probability q) are also 
distributed sparse with high probability. The "distributedness" of the sparsity pattern 
can be measured using the "degree of sparsity" d which is defined to be the maximum 
number of non-zeros in any row or column. For random sparsity patterns, we have the 
following: 

Proposition 1. Consider a random matrix X G whose entries are independent 

copies of the Bernoulli{'^) ^distribution where = A = B(l). Then for any e > 0^ X 
has at most d Vs in each row /column with probability at least I — e, where 




^Recall that if X ~ Bernoulli(7), then P(x = 1) = 7 and P(x = 0) = 1 — 7. 
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(The proof is straightforward, and available in Appendix B.) 

In a similar vein, combinatorial graphs have small degree in a variety of applications, 
and their corresponding matrix representation will then be distributed sparse. For 
instance, Erdos-Renyi random graphs G{p^ q) with pq = 0{logp) have small degree 

1.1 Problem Setup and Main Results 

Our goal is to invert an underdetermined linear system of the form 

Y = AXB^, (1) 

where A = [aij] G W^'^p.B = [bij] G M"^^^, with m < p and X G Since the matrix 

X G MP^P is linearly transformed to obtain the smaller dimensional matrix Y G MJ^^^^ we 
will refer to Y as the sketch (borrowing terminology from the computer science literature 
[40]) of X and we will refer to the quantity m as the sketching dimension. Since the value 
of m signifies the amount of compression achieved, it is desirable to have as small a value of 
m as possible. 

Rewriting the above using tensor product notation, with y = vec(y) and x = vec(X), 
we equivalently have 

y = {B®A)x, (2) 

where vec(X) simply vectorizes the matrix X, i.e., produces a long column vector by stacking 
the columns of the matrix and B(^Ai8 the tensor (or Kronecker) product of B and A^ given 
by 

buA buA ••• bipA 

b2lA b22A ••• b2pA 

::•.:* 

_ bmlA bm2A • • • bmpA _ 

While it is not possible to invert such underdetermined systems of equations in gen- 
eral, the rapidly growing literature on what has come to be known as compressed sensing 
suggests that this can be done under certain assumptions. In particular, taking cues from 
this literature, one might think that this is possible if x (or equivalently X) has only a few 
non-zeros. 

Let us first consider the case when there are only k = Q{1) non-zeros in X, i.e., it is very 
sparse. Then, it is possible to prove that the optimization program (Pi) recovers X from 
AXB^ using standard "RIP-based" techniques [11]. We refer the interested reader to the 
papers by Jokar et al [ ] and Duarte et al [21] for more details, but in essence the authors 
show that if 5r{A) and 5r{B) are the restricted isometry constants (of order r) [ ] for ^ and 
B respectively, then the following is true about B <S) A 

md.x{6r{A),6r{B)} < Sr{A®B) = Sr{B®A) < {l + 6r{A)) {l + Sr{B)) - 1. 

In many interesting problems that arise naturally, as we will see in subsequent sections, 
a more realistic assumption to make is that X has 0{p) non-zeros and it is this setting we 
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consider for this paper. Unfortunately, the proof techniques outhned above cannot succeed 
in such a demanding scenario. As hinted earher, it wiU turn out however that one cannot 
handle arbitrary sparsity patterns and that the non-zero pattern of X needs to be distributed^ 
i.e., each row/column of X cannot have more than a few, say rf, non-zeros. We will call such 
matrices rf— distributed sparse (see Definition 3). We explore this notion of structure in more 
detail in Section 3.1. 

An obvious, albeit highly impractical, approach to recover a (distributed) sparse X from 
measurements of the form Y = AXB^ is the following: search over all matrices X G MP^^ 
such that AXB^ agrees with Y = AXB^ and find the sparsest one. One might hope that 
under reasonable assumptions, such a procedure would return X as the solution. However, 
there is no guarantee that this approach might work and worse still, such a search procedure 
is known to be computationally infeasible. 

We instead consider solving the optimization program (Pi) which is a natural (convex) 
relaxation of the above approach. 



mmimize 

X 



X 
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subject to AXB^ = Y. 



(Pi 



Here, by ||X||i we mean ^ij ^ i-^-^ the ii norm of vec(X). 

The main part of the paper is devoted to showing with high probability (Pi) has a unique 
solution that equals X. In particular, we prove the following result. 

Theorem 1. Suppose that X is d— distributed sparse. Also^ suppose that A^B G {0, 

are drawn independently and uniformly from the 6— random bipartite ensemble^. Then as 

long as 

m = 0{^/dplogp) and 5 = (^(logp), 

there exists a c > such that the optimal solution X* of (Pi) equals X with probability 
exceeding 1 —p~^. Furthermore^ this holds even if B equals A. 

In Section 4, we will prove Theorem 1 for the case when B = A. It is quite straightforward 
to modify this proof to the case where A and B are independently drawn, since there is much 
more independence that can be leveraged. 

Let us pause here and consider some implications of this theorem. 

1. (Pi) does not impose any structural restrictions on X*. In other words, even though 
X is assumed to be distributed sparse, this (highly non-convex) constraint need not be 
factored in to the optimization problem. This ensures that (Pi) is a Linear Program 
(see e.g., [ ]) and can thus be solved eflSciently. 



^Roughly speaking, the (5— random bipartite ensemble consists of the set of all 0-1 matrices that have 
almost exactly S ones per column. We refer the reader to Definition 4 for the precise definition and Section 3.2 
for more details. 



4 



2. Recall that what we observe can be thought of as the vector 

{B ® A)x. Since X is rf— distributed sparse, x has 0{dp) non-zeros. Now, even if 
an oracle were to reveal the exact locations of these non-zeros, we would require at 
least 0{dp) measurements to be able to perform the necessary inversion to recover 
X. In other words, it is absolutely necessary for to be at least 0{dp). Comparing 
this to Theorem 1 shows that the simple algorithm we propose is near optimal in the 
sense that it is only a logarithm away from this trivial lower bound. This logarithmic 
factor also makes an appearance in the measurement bounds in the compressed sensing 
literature [^^]. 

3. Finally, as mentioned earlier, inversion of under-determined linear systems where the 
linear operator assumes a tensor product structure has been studied earlier [ , ]. 
However, these methods are relevant only in the regime where the sparsity of the 
signal to be recovered is much smaller than the dimension p. The proof techniques 
they employ will unfortunately not allow one to handle the more demanding situation 
the sparsity scales linearly in p and if one attempted an extension of their techniques 
naively to this situation, one would see that the sketch size m needs to scale like 
0{dp\o^ p) in order to recover a d— distributed sparse matrix X. This is of course 
uninteresting since it would imply that the size of the sketch is bigger than the size of 
X. 

It is possible that Y was not exactly observed, but rather is only available to us as a cor- 
rupted version Y . For instance, Y could be Y corrupted by independent zero mean, additive 
Gaussian noise or in case of the covariance sketching problem discussed in Section 1.3.1, Y 
could be an empirical estimate of the covariance matrix Y = AXA^. In both these cases, a 
natural relaxation to (Pi) would be the following optimization program (P2) (with B set to 
A in the latter case). 

(P2) 

Notice that if X was a sparse covariance matrix and if A = B = Ipxp^ then (P2) reduces to 
the soft thresholding estimator of sparse covariance matrices studied in [ ] . 

While our experimental results show that this optimization program (P2) performs well, 
we leave its exploration and analysis to future work. We will instead state the following 
''approximation" result that shows that the solution of (Pi) is close to the optimal d— 
distributed sparse approximation for any matrix X. The proof is similar to the proof of 
Theorem 3 in [ ] and is provided in Appendix C. Given p G N, let [p] denote the set 
{1, 2, . . . and let Wd,p denote the following collection of subsets of [p] x [p]: 

m,p ■■= C [p] X [p] : \n n {{i} x[p]}\<d,\nn {[p] x {i}}\ < d, for all i G [p]} . 

Notice that if a matrix X G is such that there exists an G Wd,p with the property 

that Xij 7^ only if (i,j) G f^, then the matrix is rf— distributed sparse. 



minimize \\AXB^ -Y\\l + X\\X\ 

if 
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Given Q G Wd^p and a matrix X G M^^^, we write Xq to denote the projection of X onto 
the set of aU matrices supported on Q. That is, 



[Xn] 



x,^j if G n 







otherwise 



for all (i^j) G [p] x [p]. 



Theorem 2. Suppose that X is an arbitrary p x p matrix and that the hypotheses of The- 
orem 1 hold. Let X* denote the solution to the optimization program (Pi). Then, there 
exist constants c > and e G (0, 1/4) such that the following holds with probability exceeding 
1 — p~^. 

^ ~ min IIX - X 



X' 



X L < I mm 

^ 1-4:6 \neWd,p 



(4) 



The above theorem tells us that even if X is not structured in any way, the solution of 
the optimization program (Pi) approximates X as well as the best possible rf— distributed 
sparse approximation of X (up to a constant factor). This has interesting implications, for 
instance, to situations where a rf— distributed sparse X is corrupted by a "noise" matrix X 
as shown in the following corollary. 



Corollary 1. Suppose X G is d— distributed sparse and suppose that X 

Then, the solution X* to the optimization program 



X + N. 



mm. 

X 



X 



subject to AXB^ = AXB^ 



satisfies 



|X* -X||i < y 



12e 



46 



IXI 



(5) 



Proof. Let Q, be the support of X. To prove the result, we will consider the following chain 
of inequalities. 



\X* -X\\^ < 



(a) 2 
< 



X* -X 
4e 



+ 



l-4e 



X-X, 



X-X 
a + 



1 

X 



X 



2 


-4e 






2 


-4e 








< 


X-X 


H 




Xn-X 


+ 


X-X 


- 1 


-4e 




1 




-4e 


1 





(b) 5 - 12e 

< 

- l-4e 

(c) 5- 12e 
~ 1 -4e 



X-X 



1 • 



Here (a) follows from Theorem 2 since Q G Wd,p and {b) follows from the fact that 



Xn 
X. 



X 



< 



x-x 



since Xqc is Opxp- Finally, in (c) we merely plug in the definition of 

□ 
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1.2 The Rectangular Case and Higher Dimensional Signals 

While Theorem 1, as stated, apphes only to the case of square matrices X, we can extend our 
result in a straightforward manner to the rectangular case. Consider a matrix X G M^^^^^ 
where (without loss of generality) pi < P2- We assume that the row degree is dr (i.e. no 
row of X has more than dr non-zeros) and that the column degree is dc- Consider sketching 
matrices A G M^><^i and B G MJ^^P"^ and the sketching operation: 

Y = AXB^. 

Then we have the following corollary: 

Corollary 2. Suppose that X is distributed sparse with row degree dr and colum degree dc. 
Also, suppose that A G {0, ij^^^^i^ B G {0, ij^^^^ drawn independently and uniformly 
from the 5— random bipartite ensemble. Let us define p = max(]9i,]92) and d = max(rf^, rfc)- 
Then if 

m = (^(y^logp) and 5 = (^(logp), 

there exists a c > such that the optimal solution X* of (Pi) equals X with probability 
exceeding 1 — p~^. 

Proof. Let us define the matrix X G M^^^ as 



i.e. it is made square by padding additional zero rows. Note that X has degree d = 
max(rf^, dc). Moreover note that the matrix A G M^^^^i can be augmented to G W^^p via: 

A=[A A] 

where A G ]R^><(^~^i) is also drawn from the (^-random bipartite ensemble. Then one has the 
relation: 

Y = AXB^. 

Thus, the rectangular problem can be reduced to the standard square case considered in 
Theorem 1, and the result follows. 

□ 

The above result shows that a finer analysis is required for the rectangular case. For 
instance, if one were to consider a scenario where pi = 1, then from the compressed sensing 
literature, we know that the result of Corollary 2 is weak. We believe that determining the 
right scaling of the sketch dimension(s) in the case when X is rectangular is an interesting 
avenue for future work. 

Finally, we must also state that while the results in this paper only deal with two- 
dimensional signals, similar techniques can be used to deal with higher dimensional tensors 
that are distributed sparse. We leave a detailed exploration of this question to future work. 
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1.3 Applications 



It is instructive at this stage to consider a few examples of the framework we set up in this 
paper. These apphcations demonstrate that the modehng assumptions we make viz., tensor 
product sensing and distributed sparsity are important and arise naturally in a wide variety 
of contexts. 

1.3.1 Covariance Estimation from Compressed realizations 
or Covariance Sketching 

One particular application that will be of interest to us is the estimation of covariance 
matrices from sketches of the sample vectors. We call this covariance sketching. 

Consider a scenario in which the covariance matrix S G M^^^ of a high-dimensional 
zero-mean random vector ^ = (<Ci, • • • , Cp)"^ is to be estimated. In many applications of 
interest, one determines S by conducting correlation tests for each pair of covariates (^^, Q and 
computing an estimate of E[(^^(^j] for i, j = 1, . . . This requires one to perform correlation 
tests for 0{p^) pairs of covariates, a daunting task in the high-dimensional setting. Perhaps 
most importantly, in many cases of interest, the underlying covariance matrix may have 
structure, which such an approach may fail to exploit. For instance if S is very sparse, it 
would be vastly more efficient to perform correlation tests corresponding to only the non-zero 
entries. The chief difficulty of course is that the sparsity pattern is rarely known in advance, 
and finding this is often the objective of the experiment. 

In other settings of interest, one may obtain statistical samples by observing n indepen- 
dent sample paths of the statistical process. When ^ is high-dimensional, it may be infeasible 
or undesirable to sample and store the entire sample paths . . . , ^^^^ G M^, and it may be 
desirable to reduce the dimensionality of the acquired samples. 

Thus in the high-dimensional setting we propose an alternative acquisition mechanism: 
pool covariates together to form a collection of new variables Zi, . . . , Z^, where m < p. For 
example one may construct: 

and so on; more generally we have measurements of the form Z — where A G W^^^ and 
typically m <^ p. We call the thus newly constructed covariates Z — (Zi, . . . , Z^) a sketch 
of the random vector ^. 

More formally, the covariance sketching problem can be stated as follows. Let . . . , ^^""^ G 

MP be n independent and identically distributed ]9— variate random vectors and let S G MP^^ 
be their unknown covariance matrix. Now, suppose that one has access to the m— dimensional 
sketch vectors Z^'^^ such that 

Z» = i = l,2,...,n, 

where A G W^^^^m < p is what we call a sketching matrix. The goal then is to recover 
S using only {Z^^^}^^^. The sketching matrices we will focus on later will have randomly- 
generated binary values, so each element of Z^^^ will turn out to be a sum (or "pool") of a 
random subset of the covariates. 
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Notice that the sample covariance matrix computed using the vectors satisfies 
the foUowing. 



1=1 



where S^^^ := - ^^^i C^^HC^^^)"^ is the (maximum hkehhood) estimate of S from the samples 

To gain a better understanding of the covariance sketching problem, it is natural to first 
consider the stylized version of the problem suggested by the above calculation. That is, 
whether it is possible to efficiently recover a matrix S G MP^^ given the ideal covariance 
matrix of the sketches = AY^A^ G W^^^ . The analysis in the current paper focuses on 
exactly this problem and thus helps in exposing the most unique and challenging aspects of 
covariance sketching. 

The theory developed in this paper tells us that at the very least, one needs to restrict 
the underlying random vector ^ to have the property that each depends on only a few (say, 
d) of the other ^j's. Notice that this would of course imply that the true covariance matrix 
S will be rf— distributed sparse. Applying Theorem 1, especially the version in which the 
matrices A and B are identical, to this stylized situation reveals the following result. If A is 
chosen from a particular random ensemble and if one gets to observe the covariance matrix 
ATiAi^ of the sketch random vector Z = A^^ then using a very eflScient convex optimization 
program, one can recover S exactly. 

Now, suppose that ^ and A are as above and that we get to observe n samples Z^^^ = 
A^^'^\i = 1,2, ...,n of the sketch Z = A^. Notice that we can can consider S^^^ := 
n Sr=i C^^HC^^^)^ to be a "noise corrupted version" of S since we can write 

sw = s + (s(^)-s), 

where, under reasonable assumptions on the underlying distribution, 

Y,{n) _ ^ ^ Q g,^}]32ost surely as n ^ oc by the strong law of large of numbers. Therefore, 
1 

an application of Theorem 2 tells us that solving (Pi) with the observation matrix gives 
us an asymptotically consistent procedure to estimate the covariance matrix S from sketched 
realizations. 

We anticipate that our results will be interesting in many areas such as quantitative 
biology where it may be possible to naturally pool together covariates and measure interac- 
tions at this pool level. Our work shows that covariance structures that occur naturally are 
amenable to covariance sketching, so that drastic savings are possible when correlation tests 
are performed at the pool level, rather than using individual covariates. 
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The framework we develop in this paper can also be used to accomplish cross covariance 
sketching. That is, suppose that ^ and ( are two zero mean variate random vectors and 
suppose that G M^^^ is an unknown matrix such that [S^^^]^^. = E[^^Cj]. Let {^^^^^^^ 

and {C^^^}^^;^ be 2n independent and identically distributed random realizations of ^ and ( 
respectively. The goal then, is to estimate S^^^ from the m dimensional sketch vectors Z^^^ 
and Vl^(^) such that 

= A^W, i^W = ^C^') i = 1, 2, . . . , n, 

where A, B e W^^^, m < p. 

As above, in the idealized case. Theorem 1 shows that the cross-covariance matrix 
of ^ and ( can be exactly recovered from the cross-covariance matrix T^zw = ^^CC^^ 
the sketched random vectors W and Z as long as S^^^ is distributed sparse. In the case we 
have n samples each of the sketched random vectors, an application of Theorem 2 to this 
problem tells us that (Pi) is an efficient and asymptotically consistent procedure to estimate 
a distributed sparse from compressed realization of ^ and (. 

We note that the idea of pooling information in statistics, especially in the context of 
meta analysis is a classical one [^^]. For instance the classical Cohen's d estimate uses the 
idea of pooling samples obtained from different distributions to obtain accurate estimates 
of a common variance. While at a high level the idea of pooling is related, we note that 
our notion is qualitatively different in that we propose pooling covariates themselves into 
sketches and obtain samples in this reduced dimensional space. 

1.3.2 Graph Sketching 

Large graphs play an important role in many prominent problems of current interest; two 
such examples are graphs associated to communication networks (such as the internet) and 
social networks. Due to their large sizes it is difficult to store, communicate, and analyze 
these graphs, and it is desirable to compress these graphs so that these tasks are easier. The 
problem of compressing or sketching graphs has recently gained attention in the literature 
[1, 24]. 

In this section we propose a new and natural notion of compression of a given graph 
G = (V^E). The resulting "compressed" graph is a weighted graph G = (]/,£'), where V 
has a much smaller cardinality than V. Typically, G will be a complete graph, but the edge 
weights will encode interesting and valuable information about the original graph. 

Partition the vertex set 1/ = l/i U V2 U . . . U in the compressed graph (7, each partition 
Vi is represented by a node. (We note that this need not necessarily be a disjoint partition, 
and we allow for the possibility for Vi ilVj ^ 0.) For each pair Vi^Vj G V ^ the associated 
edge weight is the total number of edges crossing from nodes in Vi to the nodes in in the 
original graph G. Note that if an index k G Vi OVj^ the self edge (fc, k) must be included 
when counting the total number of edges between Vi and Vj. (We point out that the edge 
{Vk^ Vk) ^ E also carries a non-zero weight; and is precisely equal to the number of edges in 
G that have both endpoints in Vk- See Fig. 1 for an illustrative example.) 
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Define Ai to be the (row) indicator vector for the set 1^, i.e. 

A-. = f ^ 

^■^ \ otherwise 

If X denotes the adjacency matrix of G, then Y := AXA^ denotes the matrix representation 
of G. The sketch Y has two interesting properties: 

• The encoding faithfuUy preserves high-level "cut" information about the original graph. 
For instance information such as the weight of edges crossing between the partitions 
Vi and Vj is faithfully encoded. This could be useful for networks where the vertex 
partitions have a natural interpretation such as geographical regions; questions about 
the total network capacity between two regions is directly available via this method of 
encoding a graph. Approximate solutions to related questions such as maximum fiow 
between two regions (partitions) can also be provided by solving the problem on the 
compressed graph. 

• When the graph is bounded degree, the results in this paper show that there exists a 
suitable random partitioning scheme such that the proposed method of encoding the 
graph is lossless. Moreover, the original graph G can be unravelled from the smaller 
sketched graph G efficiently using the convex program (Pi). 




(a) (b) (c) 

Figure 1: An example illustrating graph sketching, (a) A graph G with 17 nodes (b) Parti- 
tioning the nodes into four partitions l^i, V2, V3, V4 (c) The sketch of the graph G. The nodes 
represent the partitions and the edges in the sketch represent the total number of edges of 
G that cross partitions. 

1.3.3 Multidimensional Signal Processing 

Multi-dimensional signals arise in a variety of applications, for instance images are naturally 
represented as two-dimensional signals /(•,•) over some given domain. 

Often it is more convenient to view the signal not in the original domain, but rather 
in a transformed domain. Given some one dimensional family of "mother" functions il^uit) 
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(usually an orthonormal family of functions indexed with respect to the transform variable 
u)^ such a family induces the transform for a one dimensional signal f{t) (with domain T) 
via 

f{u) = [ fmuit)- 

JteT 

For instance if ipu{t) := exp{—i27rut)^ this is the Fourier transform, and if ipu{t) is chosen to 
be a wavelet function (where u = (a^b)^ the translation and scale parameters respectively) 
this generates the well-known wavelet transform that is now ubiquitous in signal processing. 

Using ipuit) to form an orthonormal basis for one-dimensional signals, it is straightforward 
to extend to a basis for two-dimensional signal by using the functions ipu{t)ipy{r). Indeed, 
this defines a two-dimensional transform via 

fM= [ f{t,r)iJumvir). 

Similar to the one-dimensional case, appropriate choices of ^0 yeild standard transforms such 
as the two-dimensional Fourier transform and the two-dimensional wavelet transform. The 
advantage of working with an alternate basis as described above is that signals often have 
particularly simple representations when the basis is appropriately chosen. It is well-known, 
for instance, that natural images have a sparse representation in the wavelet basis (see Fig. 
2). Indeed, many natural images are not only sparse, but they are also distributed sparse, 
when represented in the wavelet basis. This enables compression by performing "pooling" 
of wavelet coefficients, as described below. 



Image Wavelet coefTlclents 




Figure 2: Note that the wavelet representaion of the image is distributed sparse. 

In many applications, it is more convenient to work with discrete signals and their trans- 
forms (by discretizing the variables (t, r) and the transform domain variables {u^v)). It 
is natural to represent the discretization of the two-dimensional signal f{t^r) by a ma- 
trix F G M^^^. The corresponding discretization of ipuit) can be represented as a matrix 
= [^]ut^ and the discretized version of the f{u^v)^ denoted by F is given by: 

F = "i^F^^. 



12 



As noted above, in several applications of interest, when the basis is chosen appro- 
priately, the signal has a succinct representation and the corresponding matrix F is sparse. 
This is true, for instance, when F represents a natural image and F is the wavelet transform 
of F. Due to the sparse representability of the signal in this basis, it is possible to acquire 
and store the signal in a compressive manner. For instance, instead of sensing the signal F 
using (which correpsonds to sensing the signal at every value of the transform variable u) , 
one could instead form "pools" of transform variables Si = {uii^Ui2 . . . ^Uik} and sense the 
signal via 



ueSi 



^ 7, 



where the matrix A corresponds to the pooling operation. This means of compression corre- 
sponds to "mixing" measurements at randomly chosen transform domain values u. (When 
is the Fourier transform, this corresponds to randomly chosen frequencies, and when is 
the wavelet, this corresponds to mixing randomly chosen translation and scale parameters) 
. When the signal F is acquired in this manner, we obtain measurements of the form: 

Y = AFA^, 



where F is suitably sparse. Note that one may choose different random mixtures of mea- 
surements for the t and r "spatial" variables, in which case one would obtain measurements 
of the form: 

Y = AFB^. 

The theory developed in this paper shows how one can recover the multi-dimensional signal 
F from such an undersampled acquisition mechanism. In particular, our results will show 
that if the pooling of the transform variable is done suitably randomly, then there is an 
efficient method based on linear programming that can be used to recover the original multi- 
dimensional signal. 



1.4 Related Work and Obstacles to Common Approaches 

The problem of recovering sparse signals via ii regularization and convex optimization has 
been studied extensively in the past decade; our work fits broadly into this context. In 
the signal processing community, the literature on compressed sensing [10, 2( ] focuses on 
recovering sparse signals from data. In the statistics community, the LASSO formulation as 
proposed by Tibshirani, and subsequently analyzed (for variable selection) by Meinshausen 
and Biihlmann [39], and Wainwright [ ] are also closely related. Other examples of struc- 
tured model selection include estimation of models with a few latent factors (leading to 
low-rank covariance matrices) [ ] , models specified by banded or sparse covariance matrices 
[7, 8], and Markov or graphical models [38, 39, 42]. These ideas have been studied in depth 
and extended to analyze numerous other model selection problems in statistics and signal 
processing [^^ i^t, i^]. 
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Our work is also motivated by the work on sketching in the computer science community; 
this hterature deals with the idea of compressing high-dimensional data vectors via projection 
to low-dimensions while preserving pertinent geometric properties. The celebrated Johnson- 
Lindenstrauss Lemma [os] is one such result, and the idea of sketching has been explored 
in various contexts 34]. The idea of using random bipartite graphs and their related 
expansion properties, which motivated our approach to the problem, have also been studied 
in past work [o, 35, 36]. 

While most of the work on sparse recovery focuses on sensing matrices where each entry 
is an i.i.d. random variable, there are a few lines of work that explore structured sensing 
matrices. For instance, there have been studies of matrices with Toeplitz structure [28], 
or those with random entries with independent rows but with possibly dependent columns 
[47, 44]. Also related is the work on deterministic dictionaries for compressed sensing [16], 
although those approaches yield results that are too weak for our setup. 

One interesting aspect of our work is that we show that it is possible to use highly 
constrained sensing matrices (i.e. those with tensor product structure) to recover the signal of 
interest. Many standard techniques fail in this setting. Restricted isometry based approaches 
[11] and coherence based approaches [18, 26, 46] fail due to a lack of independence structure 
in the sensing matrix. Indeed, the restricted isometry constants as well as the coherence 
constants are known to be weak for tensor product sensing operators [22, 32]. Gaussian 
width based analysis approaches [ ] fail because the kernel of the sensing matrix is not a 
uniformly random subspace and hence not amenable to a similar application of Gordon's 
("escape through the mesh") theorem. We overcome these technical difficulties by working 
directly with combinatorial properties of the tensor product of a random bipartite graph, 
and exploiting those to prove the so-called nuUspace property [i9, 15]. 

2 Experiments 

We demonstrate the validity of our theory with some preliminary experiments in this section. 
Figure 3 shows a 40 x 40 distributed sparse matrix on the left side. The matrix on the right 
is a perfect reconstruction using a sketch dimension of m = 21. 

Figure 4 is what is known now as the "phase transition diagram". Each coordinate 
(i, j) G {10, 12, ... , 60} X {2, 4, ... , 60} in the figure corresponds to an experiment with 
p = i and m = j. The value at the coordinate (i^j) was generated as follows. A random 
4-distributed sparse X G M^^^ was generated and a random A G W^'^ was generated as 
adjacency matrix of a graph as described in Definition 4. Then the optimization problem 
(Pi) was solved using the CVX toolbox [30, 25]. The solution X* was compared to X in the 
II '11^ norm (upto numerical precision errors). This was repeated 40 times and the average 
number of successes was reported in the (i, j'')-th spot. In the figure, the white region denotes 
success during each trial and the black region denotes failure in every single trial and there 
is a sharp phase transition between successes and failures. In fact, the curve that borders 
this phase transition region roughly looks like the curve p = j^rn^ which is what our theory 
predicts (upto constants and log factors). 
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Figure 3: The matrix on the left is a 40 x 40 sparse matrix and the matrix on the right is a 
perfect reconstruction with m = 21. 

We also ran some preliminary tests on trying to reconstruct a covariance matrix from 
sketches of samples drawn from the original distribution. To factor in the "noise" , we replaced 
the equality constraint in (Pi) with a constraint which restricts the feasible set to be the 




Figure 4: Phase transition plot. The {i^j)-th pixel shows (an approximation) to the proba- 
bility of success of the optimization problem (Pi) in recovering a distributed sparse X G W^'^ 
with sketch-size j. The (red) solid line shows the boundary of the phase transition regime 
and is approximately the curve p = jim'^ 
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Figure 5: The matrix on the left is a 40 x 40 distributed sparse matrix. The matrix on the 
right was reconstructed using n = 2100 samples and with sketches of size m = 21. 



set of all X such that 
validation. Figure 5 s 



AXA^ - X 



(n) 
Y 



< n instead. The parameter n was picked by cross- 



^hows a representative result which is encouraging. The matrix on 
the left is a 40 X 40 distributed sparse covariance matrix and the matrix on the right is a 
reconstruction using n = 2100 sketches of size m = 21 each. 



3 Preliminaries and Notation 

We will begin our theoretical discussion by establishing some notation and preliminary con- 
cepts that will be used through the rest of the paper. 

For any p G N, we define [p] := {1, 2, . . . ^p}. A graph G = {V^ E) is defined in the usual 
sense as an ordered pair with the vertex set V and the edge set E which is a set of 2-element 
subsets of V ^ i.e., E C (^). Henceforth, unless otherwise stated, we will deal with the graph 
G = {[p]^E). We also assume that all the graphs that we consider here include all the self 
loops, i.e., {i, i} G E for all i G [p]. For any S C [p], the set of neighbors N{S) is defined as 

NiS) = {je[p]:teS,{zj}eE}. 

For any vertex i G [p], the degree deg(i) is defined as deg(i) := |A^(i)|. 

Definition 1 (Bounded degree graphs and regular graphs). A graph G = {[p]^E) is said to 
be a bounded degree graph with (maximum) degree d if for all i G [p]^ 

deg{i) < d 

The graph is said to be rf— regular if deg{i) = d for all i G [p]. 

We will be interested in another closely related combinatorial object. Given p, m G N, 
a bipartite graph G = ([p], [m]^E) is a graph with the left set [p] and right set [m] such 
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that the edge set E only has pairs {i^j} where i is the left set and j is in the right set. A 
bipartite graph G = ([p], [m], E) is said to be 5— left regular if for all i in the left set [p], 
deg(i) = 5. Given two sets Ac [p]^ B C [m], we define the set 

E{A:B) :={{t^j)eE:ieA,jeB}, 

which we will find use for in our analysis. This set is sometimes known as the cut set. Finally 
for a set ^4 C [p] (resp. B C [m])^ we define N{A) := {j G [m] : i G A^ {hj} ^ E} (resp. 
Nji{B) := {i G [p] : j G B^ {hj} ^ ^} )• This distinction between and Nr is made only 
to reinforce the meaning of the quantities which is otherwise clear in context. 

Definition 2. (Tensor graphs) Given two bipartite graphs Gi = ([p], [m], £"1) and G2 = 
([p], [m],£'2); we define their tensor graph Gi®G2 to be the bipartite graph {[p] x [p], [m] x 
[m], El E2) where Ei E2 is such that {(i, i'), (i,/)} ^ Ei® E2 if and only if {i, j} G Ei 
and {i',j'} G E2. 

Notice that if the adjacency matrices of Gi, G2 are given respectively by AJ- ^ B^ G MP^^^ 
then the adjacency matrix of Gi <S) G2 is {A (g) B)^ G MP^x^^ 

As mentioned earlier, we will be particularly interested in the situation where B = A. 
In this case, write the tensor product of a graph G = ([p], [m], £") with itself as G^ = 
{[p] X [p], [m] X [m],E^). Here is such that {(i, i'), (jj^)} G £^ if and only if {ij} and 
{z^,/} are in E. 

Throughout this paper, we write || • || to denote norms of vectors. For instance, ||x||i and 
||x||2 respectively stand for the £1 and £2 norm of x. Furthermore, for a matrix A, we will 
often write ||A|| to denote ||vec(A)|| to avoid clutter. Therefore, the Frobenius norm of a 
matrix X will appear in this paper as ||A||2. 

3.1 Distributed Sparsity 

As promised, we will now argue that distributed sparsity is important. Towards this end, let 
us turn our attention to Figure 6 which shows two matrices with 0{p) non-zeros. Suppose 
that the non-zero pattern in X looks like that of the matrix on the left (which we dub as 
the "arrow" matrix). It is clear that it is impossible to recover this X from AXB^ even if 
we know the non-zero pattern in advance. For instance, if G ker(74), then the matrix A, 
with V added to the first column of A is such that AXB^ = AXB^ and X is also an arrow 
matrix and hence indistinguishable from A. Similarly, one can "hide" a kernel vector of B 
in the first row of the arrow matrix. In other words, it is impossible to uniquely recover A 
from AXB^. 

In what follows, we will show that if the sparsity pattern of A is more distributed^ as in 
the right side of Figure 6, then one can recover A and do so efficiently. In fact, our analysis 
will also reveal what that the size of the sketch Y needs to be able to perform this task and 
we will see that this is very close to being optimal. 

In order to make things concrete, we will now define these notions formally. 
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Arrow" matrix Distributed sparse matrix 



Figure 6: Two matrices with 0{p) non-zeros. The "arrow" matrix is impossible to recover 
by covariance sketching while the distributed sparse matrix is. 

Definition 3 (rf— distributed sets and rf— distributed sparse matrices). We say that a subset 
ft C [p] X [p] is (i— distributed if the following hold. 

1. Fori = 1,2, . . . ,p, {i,i) G Vt. 

2. For all k G [p], the cardinality of the sets Qk '= {{hj) ^ ^ ' i = k} and := 
{(i, j) G Q : j = k} is no more than d. 

The set of all d— distributed subsets of [p] x [p] will be denoted as Wp^d- We say that a matrix 
X G is rf— distributed sparse if there exists an Q G ^d,p such that supp{X) := 

{(^,i)G[p]x[p]:Xi, 7^0}cr2. 

While the theory we develop here is more generally applicable, the first point in the above 
definition makes the presentation easier. Notice that this forces the number of ofl[-diagonal 
non-zeros in each row or column of a rf— distributed sparse matrix X to be at most d—1. This 
is not a serious limitation since a more careful analysis along these lines can only improve 
the bounds we obtain by at most a constant factor. 
Examples: 

• Any diagonal matrix is rf— distributed sparse for = 1. Similarly a tridiagonal matrix 
is rf— distributed sparse with = 3. 

• The adjacency matrix of a bounded degree graph with maximum degree — 1 is 
rf— distributed sparse 

• As shown in Proposition 1, random sparse matrices are rf— distributed sparse with 
d = 0{\ogp). While we implicitly assume that d is constant with respect to p in what 
follows, all arguments work even if d grows logarithmically in p as is the case here. 

Given a matrix X G M^^^, as mentioned earlier, we write vec(X) to denote the 
vector obtained by stacking the columns of X. Suppose x = vec(X). It will be useful in 
what follows to remember that x was actually derived from the matrix X and hence we 
will employ slight abuses of notation as follows: We will say x is rf— distributed sparse when 
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we actually mean that the matrix X is. Also, we will write {i^j) to denote the index of x 
corresponding to X^j, i.e., 

We finally make two remarks: 

• Even if X were distributed sparse, the observed vector Y = AXB^ is usually unstruc- 
tured (and dense). 

• While the results in this paper focus on the regime where the maximum number of 
non-zeros d per row/column of X is a constant (with respect to p), they can be readily 
extended to the case when d grows poly-logarithmically in p. Extensions to more 
general scalings of d is an interesting avenue for future work. 

3.2 Random Bipartite Graphs, Weak Distributed Expansion and 
the Choice of the Sketching Matrices 

As alluded to earlier, we will choose the sensing matrices yl, 5 to be the adjacency matrices 
of certain random bipartite graphs. The precise definition of this notion follows. 

Definition 4 (Uniformly Random 5— left regular bipartite graph). We say that G = ([p], [m], E) 
is a uniformly random 5— left regular bipartite graph if the edge set E is a random variable 
with the following property: for each i G [p] one chooses 6 vertices ^1,^2, • • • chosen uni- 
formly and independently at random (with replacement) from [m] such that {{i^jk}}k=i ^ ^• 

Remarks: 

• Note that since we are sampling with replacement, it follows that the bipartite graph 
thus constructed may not be simple. If for instance there are two edges from the left 
node i to the right node the corresponding entry Aij = 2. 

• It is in fact possible to work with a sampling without replacement model in Definition 4 
(where the resulting graph is indeed simple) and obtain qualititatively the same results. 
We work with a "sampling with replacement" model for the ease of exposition. 

The probabilistic claims in this paper are made with respect to this probability distribution 
on the space of all bipartite graphs. 

In past work [5, 35], the authors show that a random graph generated as above is, for 
suitable values of 6,5, a (fc, e)— expander. That is, for all sets S C [p] such that \S\ < fc, 
the size of the neighborhood |A^(5')| is no less than (1 — 6)5|S'|. If A is the adjacency 
matrix of such a graph, then it can then be shown that this implies that £1 minimization 
would recover a /c— sparse vector x if one observes the sketch Ax (actually, [ ] shows that 
these two properties are equivalent). Notice that in our context, the vector that we need to 
recover is 0{p) sparse and therefore, our random graph needs to be a (0(p), e)— expander. 
Unfortunately, this turns out to not be true of Gi ® G2 when Gi and G2 are randomly chosen 
as directed above. 
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However, we prove that if Gi and G2 are picked as in Definition 4, then their tensor graph 
Gi®G2'> satisfies what can be considered a weak distributed expansion property. This roughly 
says that the neighborhood of a distributed C [p] x [p] is large enough. Moreover, we 
show that this is in fact sufficient to prove that with high probability X can be recovered 
from AXB^ efficiently. The precise statement of these combinatorial claims follows. 

Lemma 1. Suppose that Gi = ([p], [m], £"1) and G2 = ([p], [m], £"2) are two independent 
uniformly random 6— left regular bipartite graphs with 6 = 0{logp) and m = 0{^/dplogp). 
Let ^} G Wd^p be fixed. Then there exists an e ^ (O, |) such that Gi G2 has the following 
properties with probability exceeding 1 — p~^^ for some c > 0. 

1. \N{n)\ >p5\l-e). 

2. For any {i,i') G {[p] x [p]) \ n we have \N{i,i') D N{n)\ < e6^. 

3. For any {i,i') G Vt, \N{i,i') r\ N{Vt\{i,i'))\ < e5^ . 

Moreover^ all these claims continue to hold when G2 is the same as Gi. 
Remarks: 

• Part 1 of Lemma 1 says that if is a rf— distributed set, then the size of the neghborhood 
of Q is large. This can be considered a weak distributed expansion property. Notice 
that while it is reminiscent of the vertex expansion property of expander graphs, it is 
easy to see that it does not hold if Q is not distributed. Furthermore, we call it "weak" 
because the lower bound on the size of the neighborhood is 6'^p{l — e) as opposed to 
5^ \Q\ (1 — e) = 6'^dp{l — e) as one typically gets for standard expander graphs. It 
will become clear that this is one of the key combinatorial facts that ensures that the 
necessary theoretical guarantees hold for covariance sketching. 

• Parts 2 and 3 say that the number of collisions between the edges emanating out of a 
single vertex with the edges emanating out of a distributed set is small. Again, this 
combinatorial property is crucial for the proof of Theorem 1 to work. 

As stated earlier, we are particularly interested in the challenging case when Gi = G2 (or 
equivalently, their adjacency matrices B are the same). The difficultly, loosely speaking, 
stems from the fact that since we are not allowed to pick Gi and G2 separately, we have 
much less independence. In Appendix A, we will only prove Lemma 1 in the case when 
Gi = G2 since this proof can be modified in a straightforward manner to obtain the proof 
of the case when Gi and G2 are drawn independently. 

4 Proofs of Main Results 

In this section, we will prove the main theorem. To reduce clutter in our presentation, we 
will sometimes employ certain notational shortcuts. When the context is clear, the ordered 
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pair (i^i^) will simply be written as ii^ and the set [p] x [p] will be written as [p]^. Sometimes, 
we will also write A to mean A and if S' C [p] x [p], we write {A(S) A)s or As to mean 
the submatrix of A® A obtained by appending the columns {Ai Aj \ (i^j) G S}. 

As stated earlier, we will only provide the proof of Theorem 1 for the case when A = B. 
Some straightforward changes to the proof presented here readily gives one the proof for the 
case when the matrices A and B are distinct. 

4.1 Proof of Theorem 1 

We will consider an arbitrary ordering of the set [p] x [p] and we will order the edges in 
lexicographically based on this ordering, i.e., the first 5^ edges ei, . . . , 6^2 in are those 
that correspond to the first element as per the ordering on [p] x [p] and so on. Now, one 
can imagine that the graph is formed by including these edges sequentially as per the 
ordering on the edges. This allows us to partition the edge set into the set Ef of edges that 
do not collide with any of the previous edges as per the ordering and the set Ef := E^ — Ef. 
(We note that a similar proof technique was adopted in Berinde et al. [5]). 

As a first step towards proving the main theorem, we will show that the operator A® A 
preserves the £i norm of a matrix X as long as X is distributed sparse. Berinde et al.,[ ] call 
a similar property RIP-1, taking cues from the restricted isometry property that has become 
popular in literature [ ]. The proposition below can also be considered to be a restricted 
isometry property but the operator in our case is only constrained to behave like an isometry 
for distributed sparse vectors. 

Proposition 2 (-^i-RIP). Suppose X G is d— distributed sparse and A is the adjacency 
matrix of a random bipartite 6— left regular graph. Then there exists an e > such that 

(1 - 2e)S^ \\X\\^ < \\AXA^\l < 5^ \\X\\^ , (6) 

with probability exceeding 1 — p~^ for some c> 0. 

Proof. The upper bound follows (deterministically) from the fact that the induced (matrix) 
^i-norm of ^4 ^4, i.e., the maximum column sum of A <S) A^ is precisely 5^. To prove the 
lower bound, we need the following lemma. 

Lemma 2. For any X G W^^, 

\\AXA'l>5'\\X\\,-2 J2 hu',meEf\X^e\ (7) 

Proof In what follows, we wiU denote the indicator function l^u^jj^y^s by if^'^"^ We be gin 
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by observing that 

II^X^'^II^ = \\Ayec{X% 



jj'e[m]'^ 

= E 

jj'e[m]'^ 
jj'e[m]'^ 
jj'e[m]'^ 



^ ^ 1 -Xii' 



n'e[p]2 



{"'ii'} 



m'G[p]2 



El {"'ii'} Y 



(a) 
> 



E ( E 



-^77;^ 



-^ii^ I -> 



ii' ^[p]^ j j' ^[mY 



where (a) follows after observing that the first (double) sum has only one term and applying 
triangle inequality to the second sum. Since 

^ii'^W,33'^[m]^ ^^V"^ ^ = W^Wv "^^^^ coucludcs the proof of the lemma. □ 

Now, to complete the proof of Proposition 2, we need to bound the sum in the LHS of 
(7). Notice that 

= ^ \Xiif \ Tii' = ^ \Xiif \ Tiif. 

where rn' is the number of collisions of edges emanating from ii' with all the previous edges 
as per the ordering we defined earlier. Since Vt is rf— distributed, from the third part of 
Lemma 1, we have that for all ii' ^ Vt^ rn' < eS^ with probability exceeding 1 — and 
therefore, 

\Xiif\riif < ||X||^ . 



'en 



This concludes the proof. 



□ 



Next, we will use the fact that A ® A behaves as an approximate isometry (in the £1 
norm) to prove what can be considered a nuUspace property [19, L^]. This will tell us that 
the nuUspace of A A is "smooth" with respect to distributed support sets and hence £1 
minimization as proposed in (Pi) will find the right solution. 
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Proposition 3 (NuUspace Property). Suppose that A G {0,1}^^^ is the adjacency matrix 
of a random bipartite 5— left regular graph with 5 = 0{logp) and m = 0{y/dplogp) and that 
Q G Wd^p is fixed. Then, with probability exceeding 1 — p~^, for any V G such that 

AVA^ = 0, we have 

\mr<Y^W^4r^ (8) 
for some e G (0, |) and for some c > 0. 

Proof. Let V be any symmetric matrix such that AVA^ = 0. Let v = vec(V) and note that 
veci^AVA^^ = {A®A)v = 0. Let be a rf-distributed set. As indicated in Section 3, 
we define N{Q) C [m]^ to be the set of neighbors of Q with respect to the graph G^. Let 
(^4 ® ^)^(^) denote the submatrix of A <S) A that contains only those rows corresponding 
to N{Q) (and aU columns). We will slightly abuse notation and use vq to denote the 
vectorization of the projection of V onto the set i.e., vq = vec(VQ)^ where 



Vij {i,j)en 

otherwise 



Now, we can follow the following chain of inequalities: 

O=\\{A0Af^''^v\l 
= \\iA®Af^''\vn + vnc)\l 

= \\{A^A)va\\,-\\{A®Af^''')va4, 
>{l-2e)S'\\Vn\\,-\\{A®Af(''^vncl 



where the last inequality follows from Proposition 2. Resuming the chain of inequalities, we 
have: 

> {1 - 2e)5' \\Va\\, - J] \\{A ^ Af^^^^wyl 



> (1 - 2e)S' \\Vn\\, - J] \Vu'\ 

jj'eN{n),ii'en'' 
= (1 - 2e)S' \\Va\\, - J2 • 1^- 



(a) 



>{l-2e)5'\\Vn\\,- J] e^^ \Vu 



>{l-2e)S'\\Va\\,-eS'\\V\\,, 

where (a) follows from the second part of Lemma 1. Writing ||l^||i = ||^||i + H^^i^li and 
rearranging, we get the required result. □ 
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Now, we can use this to prove our main theorem. 

Proof of Theorem 1. Let Q be the support of X and notice that Q is rf— distributed. Now, 
suppose that there exists an X 7^ X such that AXA^ = Y. Observe that A (^X — X^ A^ = 
0. Now, consider 



i^iii< 



X-Xn 
X -X 



Xn 



(a) 
< 



< 



1 -3e 

e 

1 -3e 



X 



X- 

-Xqc 



1 



X 



+ 



Xc 



where (a) foUows from Proposition 3, and the last hne foUows from the fact that e < ^, again 
from Proposition 3. Therefore, the unique solution of (Pi) is X with probability exceeding 
1 — j9~^, for some c > 0. □ 



5 Conclusions 

In this paper we have introduced the notion of distributed sparsity for matrices. We have 
shown that when a matrix is X distributed sparse, and ^4, B are suitable random binary 
matrices, then it is possible to recover X from under-determined linear measurements of the 
form Y = AXB^ via ii minimization. We have also shown that this recovery procedure is 
robust in the sense that if X is equal to a distributed sparse matrix plus a perturbation, 
then our procedure returns an approximation with accuracy proportional to the size of the 
perturbation. Our results follow from a new lemma about the properties of tensor products 
of random bipartite graphs. We also describe three interesting applications where our results 
would be directly applicable. 

In future work, we plan to investigate the statistical behavior and sample complexity 
of estimating a distributed sparse matrix (and its exact support) in the presence of vari- 
ous sources of noise (such as additive Gaussian noise, and Wishart noise). We expect an 
interesting trade-off between the sketching dimension and the sample complexity. 
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Appendix A 
Proof of Lemma 1 

Lemma 1. Suppose that Gi = {[p\,[m],Ei) and G2 = ([p], [m], £"2) are two independent 
uniformly random 8— left regular bipartite graphs with 8 = 0{logp) and m = 0{^/dplogp). 
Let Q G Wd^p be fixed. Then there exists an e ^ (O, |) such that Gi (8) G2 has the following 
properties with probability exceeding 1 — p~^, for some c > 0. 

L \N{n)\ >p8\l-e). 

2. For any {i,i') G {[p] x [p]) \ Q we have \N{i,i') n N{Q)\ < e^^ 

3. For any {i,i') G \N{i,i') n N{n\ {i,i'))\ < e8\ 

Moreover, all these claims continue to hold when G2 is the same as Gi. 

Proof As stated earlier, we will only prove this lemma for the case when Gi = G2. With a 
few minor modifications, one can readily get a proof for the easier case when Gi and G2 are 
drawn independently. 

Let £1,62 and, £s respectively denote the events that the implications (1), (2) and, (3) 
are true. Notice that 

P U £^ U < F{£^) + F{£^) + P(^3^) 

= P(^^) + F{£^ I ^i)P(^i) + P(^2' I ^iM^i) 

+ p(^3^ I ^i)p(^i) + p(^3^ I ^r)p(^r) 

< 3P(^^) + F{£^ \£i)+ F{£^ I £1) 
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Our strategy will be to upper bound P(£^f),P(£^| | 8i) and, P(^^| | Si). Suppose the 
bounds were pi,p2 and, ps respectively, then it is easy to see that 

P{^in52nf3} > l-max{3pi,p2,P3}. (9) 

Part 1 . We will first show that ¥{Si) is small. Since Q is rf— distributed, the "diagonal" set 
V :— {(1, 1), . . . , is a subset of Q. Now, notice that for j ^ / G [m], i G [p], 

F[0^,/)eA^((^,^))] = (10) 

This implies that 

5(5-1) V^' 



\ m(m — Ij / 
Therefore, we can bound the expected value of |A/'(Q)| as follows. 

E[|iV(0)|] >E[|iV(P)|] 

jj'e[m\x[m\, 

\ V m(m — 1) / 

jj'e[m]x[m], \ ^ ^ 

S{S - 1) ^ 1^'' 



= m(m - 1) ( 1 - ( 1 - 

> m{m — 1) 



m(m — 1) 
V\Si6-l) iVfS^S-lf 



m(m — 1) m?{m — iy 



5 m(m — 1) 



= p5^(l-6') 



Where in the last step, we set e' = \ + . 

m(m— 1) 

To complete the proof, we must show that the random quantity |A^(f^)| cannot be much 
smaller than — e'). As a first step, we define the random variables Xjj' •— l{(j,jO^^(^)} 

and notice that the following chain of inequalities hold 



|7V(0)| > \N{V)\ > E ^^r- 



jj'e[m]x[m] 
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Therefore, we have that 



J2 Xjf < pSHl -e'- e") 



jj'e[m\x[m\ 



Also, since by above, E 



Xlj/j/ Xjf ^ pS'^{l - e'), we have that 



P[|A^(f^)| <p5\l-e)] <P 



E 



< E 



jj'elm]x[m\ 



ii'G[m]x[m] 



p5 



Now, notice that the sum ^j^j> Xjj' has m(m — 1) terms and each term in the sum is 
dependent on no more than 2m — 4 terms. Therefore, one way to bound the required quantity 
is to extract independent sub-sums from the above sum and bound the deviation of each of 
those from their means (which is the corresponding sub-sum of the mean) . A principled way 
of doing this is suggested by the celebrated Hajnal-Szemeridi theorem [ , ]. Consider 
a graph on the vertex set [m] x [m] \ {(1, 1), . . . , (m, m)} where there is an edge between 
vertices (j,/) and {ji^j'i) if j = ji and/or / = i.e., exactly when the random variables 
Xjj' and Xjij[ are dependent. Since this graph has degree B(m), Hajnal-Szemeridi theorem 
tells us that this graph can be equitable colored with B(m) colors. In other words, the above 
sum can be partitioned into B(m) sub-sums such that each sub-sum has B(m) elements and 
the random variables in each of them are independent. Along with this and the fact that 
m{m — 1) > p^^, we can use the union bound and write 



P 



jfe[m]x[m] 



J2 



2 m 



< e(m) P 



^ 2^ Xjf < E ^ 2^ Xn' 



— e 



where Ci is one of the "colors". Notice that |Ci| = B(m). 
FinaUy, using Chernoff bounds, we have 



e(m) P 



1 1 ^ 

iTm 2^ < \r~\^ 



J2 ^ii' 


-e" 






Jj'eCi 










< e(m) 


exp{-2e"2e(m)} . 


3 possible, 


settinj 


g e := 


e' + e" yeilds pi < p~'^^ 



some Ci > 0. This technique of generating large deviation bounds when one has limited 
dependence is not new, see [±i]. 
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Part 2: Now, we bound P(£^| | Si). Associated to a fixed i one can imagine 5 independent 
random trials that determine the outgoing edges from i. In a similar way there are S inde- 
pendent random trials associated to the outgoing edges of i\ Let us fix (i, i') and investigate 
the outgoing edge (in the tensor graph) determined by the first trial of i and the first trial of 
i\ The probability that this edge emanating from the vertex (i, i^) G [p]^ \ {(1, 1), . . . , (p^p)} 
hits an arbitrary vertex (j,/) G [m]^ is given by l/m?. The probability that this edge lands 
in N{^) is, therefore, given by |A^(f^)| /m^. Since there are 5^ edges that are incident on 
the expected size of overlap between N{i^i') and N{Q) is upper bounded by 



Again, to show concentration, we employ similar arguments as before and define indicator 
random variables Xi • • • ? X6'^ edich of which corresponds to one of the edges emanating from 
the vertex (i, i') and then observing that the sum '}2k=i Xk is precisely equal to the random 
quantity \N{i^i') D N{Q)\. To conclude that this random quantity concentrates, we first 
observe that, as above, the S'^ dependent terms can be divided up into Q{S) with Q(S) 
elements each such that in each set the terms are independent. Therefore, we have 



\N{i,i^)nN{Q)\ >5' 



(1 + e') 



= p 



52 



,fc=l 



< 0{5) P 



IkeCi 



< e{5) exp I -S 



where Ci is one of the colors. 

Therefore, since conditioned on Si, \N{Q)\ > 5'^p{l — e) if we pick m = b\fd!p, and 
5 = e(logp) there is a ^ = 4(e') > 2 such that, n N{^)\ > ^^^^(i + ^z)^ ^ith 

probability not exceeding p~^''^. Setting e = (1 + ^0 |A^(^)| /^^, picking m as prescribed, and 
taking union bound over (i, i^) G f^^, we get p2 < p~^'^ for some C2 > 0. 

Part 3: Next, we bound P(^| | Si). Notice that the proof is very similar to that of part 2 
when i ^ i' . So, here we will consider the quantity |A^(i, i) n N{Vt \ {(i, i)})\. 

As explained above to each left node i we associate 5 random trials that determine its 
outgoing edges. Correspondingly, if we fix a left node in the tensor graph, and think 
of its outgoing edges they are determined by the outcome of 5^ product trials. Let lk{j) be 
the indicator function of the event that in the k^^ trial of i the outgoing edge is incident on 
j. The probability that the edge associated to the (/c,/) trial associated to is incident 
on (j,/) is the random variable Note that = 1^0) if j = j' and 

otherwise. 
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Note that 



s s 

\N{t, i) n N{n \ {i, 0)1 = ^Y1 E Mj)Uf) 

k=i 1=1 (j,j')eN{n\{i,i)) 

k^l {j,j')(^N{n\(i,i)) 

When k ^ I, the trials corresponding to lk{j), 1(0') independent, and hence Klk{j)li{f) = 
^. We define Xk,i ■= J2(j,j')eN(a\{i,i)) Uij)hij') and note that E{xk,i) = ^^^2'''''^ • We also 
note that Xk,i is binary valued, and 

\N{z,t)nN{n\{z,t))\<5 + J2xk,i. 

k^l 



Therefore, 



E(|Af(i,i) n/V(SJ \ <s+(S^- <i) ^'"^\i'''*' 



5 V 5 ) m? 



Next we need to prove that the quantity of interest "^j^-^i Xk,i concentrates about its 
mean. To that end we note that these binary valued variables are such that any particular 
Xk,i is dependent on at most 26 — 2 other variables. Using Chernoff concentration bounds 
in conjunction with the Hajnal-Szemeredi based coloring argument explained in part 1 of 
this proof, followed by a union bound over i G [p] we obtain the required probability bounds 
Ps < P~^^ for some C3 > 0. 

Substituting the bounds for Pi,P2,P3 back into (9) concludes the proof. □ 

Appendix B 

Proof of Proposition 1 

Proposition 1. Consider a random matrix X G such that Xij Ber(7) where p^ = 

A = ©(1); then for any e > 0^ X is d— distributed sparse with probability at least l — e, where 

d = A I 1 + a^^sMll 
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Proof. Let X^, i = 1, . . . , p denote the sparsity of the z— th column and let X^, i = . . . , 2]9, 
denote the sparsity of the i—th row. Notice that the 2p random variables Xi, X2, . . . , are 
(dependent) Bin(]9, 7) random variables. With the choice of d as indicated in the theorem, 
we have the following, 

W j P'A\ 21og(2p/e) 



w r /3A 

< exp 



2p 



2 + /5j' A 
2" 



where (a) follows from the multiplicative form of the Chernoff Bound [ ] and (b) follows as 
long as /9 > 2. The rest of the proof follows from a simple application of the union bound. □ 



Appendix C 
Proof of Theorem 2 

Theorem 2. Suppose that X is a p x p matrix. Furthermore, suppose that the hypotheses 
of Theorem 1 hold and let X* be the solution to the optimization program (Pi). Then, there 
exists a c > and an e £ (0,1/4) such that the following holds with probability exceeding 
1 — p~^. 

IIX* -XIL < ( min ||X-Xq|L). (11) 

Proof. Since X* is the optimum of the optimization program (Pi), we have that ||X||^ > 
II X* lip Let VL* be such that \\X — Xq* \\i = mins^ggudp 11^ ~ ^^11- We can proceed as follows 



ll^lli > ll^lli (12) 

= ||(X + X* - X)^\\, + ||(X + X* - X)^4, (13) 

> llXnIli - ||(X* - XU\, + ||(X* - X)nc||i - \\X^A\i (14) 
= ||X||, - 2 ||Xnc||, + ||X* - X||, - 2 ||(X - X*)n\\, (15) 

> ||X||, - 2 llX^cll, + \\X* - X\\, (16) 

where in the last step, we have used the fact that since X* is a feasible point in (Pi), 

AX*B^ = AXB^ and therefore, we can apply the result of Proposition 3 to X* — X. This 

completes the proof. □ 



33 



